feature transform
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Liaoning Province > Dalian (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
Relational Self-Attention: What's Missing in Attention for Video Understanding
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.
Universal Style Transfer via Feature Transforms
Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures by simple feature coloring.
- Asia > China > Liaoning Province > Dalian (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper introduces an efficient feature transform of local decorrelation, which when combined with boosted (orthogonal) decision trees, considerably improves over the state-of-the-art on pedestrian detection. Overall, it is a clearly (and nicely) written paper with good analysis, enough details and solid experiments. Pros: - Very well written and executed paper - Attention to detail - Solid results - Straight forward and intuitive method Cons: - Incremental from Hariharan et al. (not major, see later) - If it claims ``Improved Detection'', as opposed to ``Improved Pedestrian Detection'', then I would have liked to see some more results on object detection or likewise. Going from global to local decorrelation, and doing the right analysis for design decisions set it apart.
- Transportation > Ground > Road (0.58)
- Automobiles & Trucks (0.58)